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Reduced-Latency Sof t-In/Sof t-Out Module 



Related Application 
This application claims the benefit of, and incorporates 
5 herein by reference, U.S. Provisional Patent Application No. 
60/201,583, filed May 3, 2000. 

Origin of Invention 
u The research and development described herein were supported 

ClflO by the National Science Foundation under grant number NCR-CCR- 
Oj 9726391. The US government may have certain rights in the 
"=i claimed- inventions. 



detection or decoding of concatenated codes. More specifically, 
this application relates to computing the soft-inverse of a 
finite-state machine (i.e., the soft-in / soft-out or "SISO" 
module), for example, such as used in turbo decoders. 



Field of the Invention 
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The present application describes techniques for iterative 
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Background 

Due to the presence of noise on a communications channel, 
the signal received is not always identical to the signal 
transmitted. Channel coding, or equivalently, error correction 
5 coding, relates to techniques for increasing the probability that 
a receiver in a communications systems will be able to correctly 
detect the composition of the transmitted data stream. 
Typically, this is accomplished by encoding the signal to add 

□ redundancy before it is transmitted. This redundancy increases 

S3. 

i : r 

f s S0 the likelihood that the receiver will be able to correctly decode 

7^ the encoded signal and recover the original data. 

_] Turbo coding, a relatively new but widely used channel 

^ coding technique, has made signaling at power efficiencies close 

to the theoretical limits possible. The features of a turbo code 
□5 include parallel code concatenation, non-uniform interleaving, 

□ and iterative decoding. Because turbo codes may substantially 
improve energy efficiency, they are attractive for use oyer 
channels that are power and/or interference limited. 

A turbo decoder may be used to decode the turbo code. The 
20 turbo decoder may include two soft-input/ soft-output (SISO) 
decoding modules that work together in an iterative fashion. 
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The SISO decoding module is the basic building block for 
established iterative detection techniques for a system having a 
network of finite state machines, or more generally, subsystems. 
Figs. 7A and 7B respectively show block diagrams of typical 
5 turbo encoder 10 and turbo decoder 11 arrangements. In this 
example, the turbo encoder 10 uses two separate encoders, RSC1 
and RSC2, each a Recursive Systematic Convolutional encoder. 
Each of the encoders RSC1 and RSC2 can be modeled as a finite 
□ state machine (FSM) having a certain number of states (typically, 
r£o either four or eight for turbo encoders) and transitions 
fh therebetween. To encode a bit stream for transmission over a 
K ] channel, uncoded data bits bk are input both to the first encoder 
RSC1 and an interleaver I. The interleaver I shuffles the input 
sequence bk to increase randomness and introduces the shuffled 
sequence ak to the second decoder RSC2 . The outputs of encoders 

l z i 

U RSC1 and RSC2, Ck and dk respectively, are punctured and 

modulated by block 12 to produced encoded outputs Xk(0) and 
Xk(l), which are transmitted over a channel to the decoder 11. 

At the decoder 11, the encoded outputs are received as noisy 
20 inputs Zk(0) and Zk(l) and are de-modulated and de-punctured by 
block 13, which is the soft-inverse of puncture and modulation 
block 12. The output of block 13, M[c k (0)], M[c k (D] and 
M[dk(l)], is "soft" information - that is, guesses or beliefs 
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about the most likely sequence of information bits to have been 
transmitted in the coded sequences c k (0), Ck(l) and d k (l), 
respectively. The decoding process continues by passing the 
received soft information M[c k (0)], M[c k (l)] to the first 
decoder, SIS01, which makes an estimate of the transmitted 
information to produce soft output M[b k ] and passes it to 
interleaver I (which uses the same interleaving mapping as the 
interleaver I in the encoder 10) to generate M[a k ] . The second 
decoder, SIS02, uses M[a k ] and received soft information M[d k (l)] 
to re-estimate the information. This second estimation is looped 
back, via the soft inverse of the interleaver, I" 1 , to SISOl 
where the estimation process starts again. The iterative process 
continues until certain conditions are met, such as a certain 
number of iterations are performed, at which point the final soft 
estimates become "hard" outputs representing the transmitted 
information . 

Each of the SISO decoder modules in the decoder 11 is the 
soft-inverse of its counterpart RSC encoder in the encoder 10. 
The conventional algorithm for computing the soft-inverse is 
known as the "forward-backward" algorithm such as described in G. 
David Forney, Jr., "The Forward-Backward Algorithm," Proceedings 
of the 34 th Allerton Conference on Communications, Control and 
Computing, pp. 432-446 (Oct. 1996) . In the forward-backward 
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algorithm, an estimate as to a value of a data bit is made by 
recursively computing the least cost path (using add-compare- 
select, or ACS, operations) forwards and backwards through the 
SISO's "trellis" - essentially an unfolded state diagram showing 
5 all possible transitions between states in a FSM. Each path 

through the SISO's trellis has a corresponding cost, based on the 
received noisy inputs, representing a likelihood that the RSC 
took a particular path through its trellis when encoding the 
□ data. Typically, the lower a path's total cost, the higher the 
r"|0 probability that the RSC made the corresponding transitions in 
£ encoding the data. In general, the forward and backward ACS 

recursions performed by a SISO can be computed either serially or 
in parallel. Performing the recursions in parallel, the faster 
of the two methods, yields an architecture with latency 0(N), 
I'lb where N is the block size of the encoded data. As used herein, 
P "latency" is the end-to-end delay for decoding a block of N bits. 
The present inventors recognized that, depending on the 
latency of other components in a decoder or other detection 
system, reducing the latency of a SISO could result in a 
20 significant improvement in the system's throughput (a measurement 
of the number of bits per second that a system architecture can 
process) . Consequently, the present inventors developed a tree- 
structured SISO that can provide reduced latency. 
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Summary 

Implementations of the tree-structured SISO may include 
various combinations of the following features. 
5 In one aspect, decoding an encoded signal (for example, a 

turbo encoded signal, a block encoded signal or the like) can be 
performed, e.g., in a wireless communications system, by 
demodulating the received encoded signal to produce soft 
f s i information, and iteratively processing the soft information with 
f3j0 one or more soft-in / soft-output (SISO) modules. At least one 
7l of the SISO modules uses a tree structure to compute forward and 
.J\ backward state metrics, for example, by performing recursive 
^ marginalization-combining operations, which may in various 
\i embodiments include min-sum operations, min*-sum operations 
C55 (where min* = min(x,y) - ln(l + e" |x_yl )), sum-product operations, 
u and/or max-product operations. 

The encoded signal may comprise at least one of a turbo 
encoded signal, a block turbo encoded signal, a low density 
parity check coded signal, a product coded signal, a 
20 convolutional coded signal, a parallel concatenated convolutional 
coded signal, and/or a serial concatenated convolutional coded 
signal. 
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The iterative processing may be terminated upon occurrence 
of a predetermined condition, for example, the completion of a 
predetermined number of iterations. The iterative processing may 
include- performing parallel prefix operation or parallel suffix 
5 operations, or both, on the soft information. Moreover, the 
iterative processing may include using soft output of a first 
SISO as soft input to another SISO, and/or may include performing 
marginalization-combining operations which form a semi-ring over 
ri the soft-information. 

fSo The tree structure used by at least one SISO may be a tree 

S structure that results in the SISO having a latency of 0(log2 N) , 
■J] where N is a block size, a Brent-Kung tree, or a forward-backward 
* M tree, e.g., having a tree structure recursion that is bi- 
directional. 

Qf5 Processing performed by at least one SISO may include tiling 

□ an observation interval into subintervals, and applying a minimum 
half-window SISO operation on each subinterval. 

In another aspect, a SISO module may include a plurality of 
fusion modules arranged into a tree structure and adapted to 
20 compute forward and backward state metrics. Each fusion module 
may be defined by the equation: 

C(*o,fci) = C(fcb,m)®cC(m,fci) 4=> C(sk„$ kt ) = min[C{s koy $ m ) + C{$ my s ki )] Vs^s^ 
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where C(k, m) is a matrix of minimum sequence metrics (MSM) of 
state pairs s k and s m based on soft-inputs between s k and s m . At 
least one of the fusion modules may compute forward and backward 
state metrics by performing recursive marginalization-combining 

5 operations. 

In another aspect, a SISO module may include one or more 
complete fusion modules (CFMs) for performing marginalization- 
combining operations in both a forward direction and a backward 
direction, one or more forward fusion modules (fFMs) for 

0 performing marginalization-combining operations only in the 

forward direction, and one or more backward fusion modules (bFMs) 
for performing marginalization-combining operations only in the 
backward direction. The one or more CFMs, fFMs, and bFMs are 
arranged into a tree structure (e.g., Brent-Kung tree, forward- 

5 backward tree) . An amount of the CFMs may be set to a minimum 
number needed to compute a soft-inverse. In general, fFMs and 
bFMs may be used in the tree structure in place of CFMs wherever 
possible . 

In another aspect, iterative detection may include receiving 
0 an input signal (e.g., a turbo encoded signal or a convolutional 
coded signal) corresponding to one or more outputs of a finite 
state machine (FSM) , and determining the soft inverse of the FSM 
by computing forward and backward state metrics of the received 
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input signal using a tree structure. The forward and backward 
state metrics may be computed by one or more SISO modules, for 
example, using a tree-structured set of marginalization-combining 
operations . 

5 Determining the soft inverse of the FSM may include 

iteratively processing soft information, for example, performing 
parallel prefix operation or parallel suffix operations, or both, 
on the soft information. Moreover, the iterative processing may 
□ include using soft output of a first SISO as soft input to 
[|0 another SISO. At least one SISO further may tile an observation 
m interval into subintervals , and apply a minimum half -window SISO 

operation on each subinterval. 
I'* In- another aspect, a turbo decoder includes a demodulator 

\vL adapted to receive as input a signal encoded by a FSM and to 

produce soft information relating to the received signal, and at 
p least one SISO module in communication with the demodulator and 
adapted to compute a soft-inverse of the FSM using a tree 
structure. The tree structure may implement a combination of 
parallel prefix and parallel suffix operations. 
20 The turbo decoder may include at least two SISO modules in 

communication with each other. In that case, the SISO modules 
may iteratively exchange soft information estimates of the 
decoded signal. In any event, at least one SISO may compute the 
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soft-inverse of the FSM by computing forward and backward state 
■ metrics of the received signal. 

In another aspect, iterative detection may be performed by 
receiving an input signal corresponding to output from one or 

5 more block encoding modules, and determining the soft inverse of 
the one or more block encoding modules by computing forward and 
backward state metrics of the received input signal using a tree 
structure. The input signal may include a block turbo encoded 
signal, a low density parity check coded signal, and/or a product 

0 coded signal. 

Determining the soft inverse of the block encoding module 
may include iteratively processing soft information, for example, 
performing parallel prefix operation or parallel suffix 
operations, or both, on the soft information. 

5 In another aspect a block decoder may include a demodulator 

adapted to receive as input a signal encoded by a block encoding 
module and to produce soft information relating to the received 
signal, and at least one SISO module in communication with the 
demodulator and adapted to compute a soft-inverse of the block 

0 encoding module using a tree structure. The tree structure used 
may implement a combination of parallel prefix and parallel 
suffix operations. The block decoder may further include at 
least two SISO modules in communication with each other, wherein 
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the SISO modules iteratively exchange soft information estimates 
of the decoded signal. In any event, at least one SISO may 
compute the soft-inverse of the block encoding module by 
computing forward and backward state metrics of the received 
5 signal. 

In another aspect, iterative detection may include receiving 
an input signal (e.g., a block error correction encoded signal, a 
block turbo encoded signal, a low density parity check coded 
signal, and/or a product coded signal) corresponding to one or 
MjlO more outputs of a module whose soft-inverse can be computed by 
01 running the forward-backward algorithm on a trellis 
Nf representation of the module (e.g., a FSM or a block encoding 
s module) , and determining the soft inverse of the module by 
[n computing forward and backward state metrics of the received 
[35 input signal using a tree structure. 

Q One or more of the following advantages may be provided. The 

techniques and methods described here result in a tree-structure 
SISO module that can have a reduced latency as low as 0(lg N) 
[where N is the block size and "lg" denotes log2] in contrast to 
20 conventional SISOs performing forward-back recursions in 

parallel, which have latency O(N). The decrease in latency comes 
primarily at a cost of chip area, with, in some cases, only a 
marginal increase in computational complexity. This tree- 
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structure SISO can'' be used to design a very high throughput turbo 
decoder, or more generally an iterative detector. Various sub- 
windowing and tiling schemes also can be used to further improve 
latency . 

These reduced-latency SISOs can be used in virtually any 
environment where it is desirable or necessary to run the 
forward-backward algorithm on a trellis. For example, the tree- 
SISO finds application in wireless communications and in many 
types of decoding (turbo decoding, block turbo decoding, 
convolutional coding including both parallel concatenated 
convolutional codes and serial concatenated convolutional codes, 
parity check coding including low density parity check (LDPC) 
codes, product codes and more generally in iterative detection) . 
The reduced-latency SISOs are particularly advantageous when 
applied to decode FSMs having a relatively small number of 
states, such as turbo code FSMs which typically have either 4 or 
8 states. 

The details of one or more embodiments are set forth in the 
accompanying drawings and the description below. Other features, 
objects, and advantages of the invention will be apparent from 
the description and drawings, and from the claims. 
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Drawing Descriptions 
Fig. 1 shows the C fusion process. 

Fig. 2 shows a fusion module array for combining the 
complete set of C matrices on [k 0 , k 0 + K] and [k 0 + K, k 0 + 2K] 
5 to obtain the complete set on [ko, ko + 2K] . 

Fig. 3 shows a tree-SISO architecture for N = 16 . 
Fig. 4 shows a tiled sub-window scheme based on the forward- 
backward algorithm. 
q Fig. 5 shows a tiled sub-window approach with 4 tree-SISOs 

f=|0 of window size 4 for N = 16 to implement a d = 4 MHW SISO. 
fU Fig. 6 is a graph showing a simulation of a standard turbo 

code for SISOs with various half-window sizes, N = 1024, and ten 
" iterations. 

"=:f Figs. 7A and 7B show examples of a turbo encoder and a turbo 

s «15 decoder, respectively. 

P Figs. 8A and 8B show a forward-backward tree-SISO (FBT-SISO) 

implementation . 

Fig. 9 shows an embodiment of a Brent-Kung Tree-SISO. 
Fig. 10 is a graph showing transistor count estimations for 
20 the Brent-Kung Tree-SISO approach of Fig. 9. 

Fig. 11 shows the trellis structure for the parity check for 
a standard Hamming (7,4) code. 
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Fig. 12 shows a tree structured representation of the 
Hamming code parity check, in which the f orward-backward-tree- 
SISO algorithm may be run on this tree with an inward (up) set of 
. message passes to the node labeled V6, followed by an outward 
5 (down) set of message passes to produce the desired soft-out 
metrics on the coded bits. 

Fig. 13 shows an example of an LDPC decoder (overall code 
rate of 0.5) in which beliefs are passed from the broadcaster 
nodes to the parity check nodes through a fixed permutation, and 
?%0 in which the parity check nodes can be implemented as a tree- 
^ structured SISO. 

" Detailed Description 

y The present inventors have developed a re-formulation of the 

U5 standard SISO computation using a combination of prefix and 
Q suffix operations. This architecture - referred to as a Tree- 
SISO - is based on tree-structures for fast parallel prefix 
computations used in Very Large Scale Integration (VLSI) 
applications such as fast-adders. Details of the Tree-SISO and 
20 various alternative embodiments follow. 

Calculating the "soft-inverse" of a finite-state machine 
(FSM) is a key operation in many data detection/decoding 
algorithms. One of the more prevalent applications is iterative 
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decoding of concatenated codes, such as turbo codes as described 
in Berrou, et al., "Near shannon limit error-correcting coding 
and decoding: turbo-codes," International Conference on 
Communications, (Geneva, Switzerland), pp. 1064-1070, May 1993; 
5 and Berrou, et al., "Near optimum error correcting coding and 
decoding: turbo-codes," IEEE Trans. Commun., vol. 44, pp. 12 61- 
1271, October 1996. However the SISO (soft-in/soft-out) module 
is widely applicable in iterative and non-iterative receivers and 
p signal processing devices. The soft-outputs generated by a SISO 
f|0 may also be thresholded to obtain optimal hard decisions (e.g., 
fU producing the same decisions as the Viterbi algorithm (G. D. 
±\ Forney, "The Viterbi algorithm," Proc. IEEE, vol. 61, pp. 268- 
; u 278, March 1973) or the Bahl algorithm (L. R. Bahl, J. Cocke, F. 

Jelinek, and J. Raviv, "Optimal decoding of linear codes for 
«l5 minimizing symbol error rate," IEEE Trans. Inform. Theory, vol. 
P IT-20, pp. 284-287, March 1974)). The general trend in many 
applications is towards higher data rates and therefore fast 
algorithms and architectures are desired. 

There are two basic performance (speed) aspects of a data 
20 detection circuit architecture. The first is throughput which is 
a measurement of the number of bits per second the architecture 
can decode. The second is latency which is the end-to-end delay 
for decoding a block of Nbits. Non-pipelined architectures are 
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those that decode only one block at a time and for which the 
throughput is simply N divided by the latency. Pipelined 
architectures, on the other hand, may decode multiple blocks 
simultaneously, shifted in time, thereby achieving much higher 

5 throughput than their non-pipelined counterparts. 

Depending on the application, the throughput and/or latency 
of the data detection hardware may be important. For example, the 
latency associated with interleaving in a turbo-coded system with 
relatively low data rate (less than lOOKb/s) will likely dominate 

0 the latency of the iterative decoding hardware. For future high- 
rate systems, however, the latency due to the interleaver may 
become relatively small, making the latency of the decoder 
significant. While pipelined decoders can often achieve the 
throughput requirements, such techniques generally do not 

5 substantially reduce latency. In addition, sometimes latency has 
a dramatic impact on overall system performance. For example, in 
a data storage system (e.g., magnetic hard drives), latency in 
the retrieval process has a dramatic impact on the performance of 
the microprocessor and the overall computer. Such magnetic 

0 storage channels may use high-speed Viterbi processing with 
turbo-coded approaches. 

The standard SISO algorithm is the forward-backward 
algorithm. The associated forward and backward recursion steps 
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can be computed in parallel for all of the FSM states at a given 
time, yielding an architecture with 0{N) computational complexity 
and latency, where N is the block size. The key result of this 
paper is the reformulation of the standard SISO computation using 
5 a combination of prefix and suffix operations, which leads to an 
architecture with 0(lg N) latency. This architecture, referred to 
as a "tree-SISO", is based on a tree-structure for fast parallel 
prefix computations in the Very Large Scale Integration (VLSI) 

Q literature (e.g., fast adders). 

j ft 

f|0 This exponential decrease in latency for the tree-SISO comes 

at the expense of increased computational complexity and area. 
C\ The exact value of these costs depends on the FSM structure 
^ (e.g., the number of states) and the details of the 

implementation. However, for a four-state convolutional code, 
1=35 such as those often used as constituent codes in turbo codes, the 
Cj tree-SISO architecture achieves 0(lg N) latency with 

computational complexity of 0(N lg N) . Note that, for this four- 
state example, the computation complexity of tree-SISO 
architecture increases sublinearly with respect to the associated 
20 speed-up. This is better than linear-scale solutions to the 

Viterbi algorithm (e.g., such as described in Fettweis, et al., 
"Parallel Viterbi algorithm implementation: Breaking the ACS- 
bottleneck" IEEE Trans. Commun., vol. 37, pp. 785-790, August 
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1989) ; the generalization of which to the SISO problem is not 
always clear. For this 4-state code example, the area associated 
with the O(lgW) -latency tree-SISO is O(N) . 
Soft- In Soft-Out Modules 

Consider a specific class of finite state machines with no 
parallel state transitions and a generic S-state trellis. Such a 
trellis has up to S transitions departing and entering each 
state. The FSM is defined by the labeling of the state 
transitions by the corresponding FSM input and FSM output. Let t* 
= (s*, a*", sjt+i) = (s*, ajt) = (sjt, sjc+i) be a trellis transition from 
state at time k to state s*+i in response to input a*. Since 
there are no parallel state transitions, t* is uniquely defined 
by any of these representations. Given that the transition t* 
occurs, the FSM output is x k (t k ) - Note that for generality the 
mapping from transitions to outputs is allowed to be dependent on 
k. 

Consider the FSM as a system that maps a digital input 
sequence a* to a digital output sequence x k . A marginal soft- 
inverse, or SISO, of this FSM can be defined as a mapping of 
soft-in (SI) information on the inputs SI (a*) and outputs SI (xk) , 
to soft-output (SO) information for a* and/or x*. The mapping is 
defined by the combining and marginalization operators used. It 
is now well-understood that one need only consider one specific 
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reasonable choice for marginalization and combining operators and 
the results easily translate to other operators of interest. 
Thus, the focus is on the min-sum marginalization-combining 
operation with the results translated to max-product, sum- 
5 product, min*-sum, and max*-sum in a standard fashion. In all 
cases, let the indices K\ and K2 define the time boundaries of 
the combining window or span used in the computation of the soft- 
output for a particular quantity u k (e.g., u* = s k , u k = a kr u k = 
ujt = x kf = (s k ,s k +d), etc.). In general, Ki and K2 are functions 
AO of k. For notational compactness, this dependency is not 
\ s ] explicitly denoted. For min-sum marginalization-combining, the 

minimum sequence metric (MSM) of a quantity u k is the metric (or 
length) of the shortest path or sequence in a combining window or 
^} span that is consistent with the conditional value of u k . 
MI5 Specifically, the MSM is defined as follows. As is the standard 
r M convention, the metric of a transition that cannot occur under 
the FSM structure is interpreted to be infinity. 

MSMg(« t ) A minM^(^), (1) 

l K x u k 

M k:(4:) A Y,Mm{t m ) (2) 

m-K x 

20 Mm{t m ) A Sl(a m ) + Sl(x m (t m )) , (3) 
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where the set of transitions starting at time Ki and ending at 
time K 2 that are consistent with u k is denoted tK\ :u k and t£l 
implicitly defines a sequence of transitions t K i, t K i + t K2 . 
Depending on the specific application, one or both of the 
5 following "extrinsic" quantities will be computed 

SO%W A MSMj?(* t ) - si(x k ) (4) 

SO%(a k ) A MSMg(a 4 ) - si(a k ) (5) 

«J Because the system on which the SISO is defined is an FSM, 

W the combining and marginalization operations in (2) -(3) can be 

MO computed efficiently. The traditional approach is the forward- 

backward algorithm which computes the MSM of the states 

a recursively forward and backward in time. Specifically, for the 

yi standard fixed-interval algorithm based on soft-in for 

fsi transitions tk, k = 0,1,..., N-l, the following recursion based on 

lib add-compare-select (ACS) operations results: 

f k (s k +l) A MSM*(5 A+ 1) = mm[f k -l(s k ) + M k (t k )] (6) 

l k s k+l 

b k (s k ) A MSMf -1 (s k ) = mm[b k +\(s k+ \) + M k (t k )] (7) 

where f-i(so) and b N (s N ) are initialized according to available 
edge information. Note that, since there are S possible values 
20 for the state, these state metrics can be viewed as (S xl) 
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vectors f k and bk. The final soft-outputs in (4) -(5) are obtained 
by marginalizing over the MSM of the transitions t k 

so 0 n - i m = - si m (8) 

where u k is either x k or a k . The operation in equation (8) is 
5 referred to as a completion operation. 

While the forward-backward algorithm is computationally 
efficient, straightforward implementations of it have large 
s «% latency (i.e., 0{N)) due to ACS bottleneck in computing the 
V£ causal and anticausal state MSMs. 

i° 

i Prefix and Suffix Operations 

CO A prefix operation is defined as a generic form of 

□ computation that takes in n inputs yo, yi,..., y n -i and produces n 

□ outputs zor zi,..., z n -i according to the following: 

[:}5 zo =. yo (9) 

Zi = yo® »• ®yi, (10) 
where 0 is any associative binary operator. 

Similarly, a suffix operation can be defined as a generic 
form of computation that takes in n inputs yo, yi, y n -i and 
20 produces n> outputs zo, zi,..., z n -\ according to 

Z R -1 = Yn-1 (ID 

z± = yi® - ®y n -if (12) 
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where <g> is any associative binary operator. Notice that a suffix 
operation is simply a (backward) prefix operation anchored at the 
other edge. 

Prefix and suffix operations are important since they enable 
5 a class of algorithms that can be implemented with low latency 
using tree-structured architectures. The most notable 
realizations of this concept are VLSI N-bit tree adders with 
latency O(lqN) . 

WO Reformulation of the SISO Operation 

W The proposed low-latency architecture is derived by 

vj formulating the SISO computations in terms of a combination of 
s prefix and suffix operations. To obtain this formulation, define 
m C{s kf s m ) f for m > k, as the- MSM of state pairs s k and s m based on 
ri5 the soft-inputs between them, i.e., C{s k ,s m ) = MSM/ 1 " 1 {s k ,s m ) . The 
'r* set of MSMs C{s k ,s m ) can be considered an (S * S) matrix C{k,m). 
The causal state MSMs f^-i can be obtained from C{0,k) by 
marginalizing (e.g., minimizing) out the condition on so. The 
backward state metrics can be obtained in a similar fashion. 
20 Specifically, for each conditional value of s k 

f k -i(s k ) = rmnC(s 0 ,s k ) (13) 
h k (s k ) = minC(s k ,s N ) (14) 
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With this observation, one key step of the algorithm is to 
compute c(0,JO- and c(k, N) for k = 0, 1,..., N-l. Note that the 
inputs of the algorithm are the one-step transition metrics which 
can be written as C(k, Jc+1) for k = 0, 1,..., N-l. To show how this 
5 algorithm can be implemented with a prefix and suffix 

computation, a.min-sum fusion operator on c matrices is defined 
that inputs two such matrices, one with a left-edge coinciding 
with the right-edge of the other, and marginalizes out the 

p midpoint to obtain a pairwise state-MSM with larger span. 

r|0 Specifically, given C(Jro/m) and C{m f ki) f we define a c Fusion 

pj Operator, or <g) C operator is defined by 

oi c (*oA) = C(* 0 >™)®c c (™^ d5) 

Note that the (g) C operator is an associative binary operator 
^ that accepts two matrices and returns one matrix. This is 
5=€5 illustrated in Fig. 1. With this definition C(0,k) and C{k, N) 

for k = 0, 1,..., N-l can be computed using the prefix and suffix 

operations as follows: 

C (0 5 it) = C (0, 1) <8> c C (1, 2) ® c ... ® c C (k - 1, k) 

C(k,N) = C(k,k + \)® c ...C(N-2,N-\)® c C(N-l,N) 

20 In general, a SISO algorithm can be based on the decoupling 

property of state-conditioning. Specifically, conditioning on all 
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possible FSM state values at time k, the shortest path problems 
(e.g., MSM computation) on either side of this state condition 
may be solved independently and then fused together (e.g., as 
performed by the c-fusion operator). More generally, the" SISO 
5 operation can be decoupled based on a partition of the 
observation interval with each subinterval processed 
independently and then fused together. For example, the forward- 
backward algorithm is based on a partition to the single- 
H transition level with the fusing taking place sequentially in the 
=%0 forward and backward directions. In contrast, other SISO 
2% algorithms may be defined by specifying the partition and a 

schedule for fusing together the solutions to the sub-problems. 
M ^ This may be viewed as specifying an association scheme to the 
H above prefix-suffix operations (i.e., grouping with parentheses). 
□5 The C-fusion operations may be simplified in some cases 

□ depending on the association scheme. For example, the forward- 
backward algorithm replaces all c-fusion operations by the much 
simpler forward and backward ACSs. However, latency is also a 
function of the association. scheme. An architecture based on a 
20 pairwise tree-structured grouping is presented below. This 

structure allows only a small subset of the C-fusion operations 
to be simplified, but facilitates a significant reduction in 
latency compared to the forward-backward algorithm, by fusing 
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solutions to the subproblems in a parallel, instead of 
sequential, manner. 

Low-Latency Tree-SISO Architectures 

5 Many low-latency parallel architectures based on binary tree- 
structured groupings of prefix operations can be adopted to SISOs 
such as described in Brent, et al., "A regular layout for 
parallel adders," IEEE Transactions on Computers, vol. C-31, pp. 
q 260-264, March 1982; T. H. Cormen, et al., Introduction to 
fpjO Algorithms. Cambridge, Mass.: The MIT Press, 1990; and A. E. 
^ Despain, "Chapter 2: Notes on computer architecture for high 
^ performance," in New Computer Architectures, Academic Press, 
— 1984. All of these have targeted n-bit adder design where the 
W binary associative operator is a simple 1-bit addition. The 
□5 present inventors were the first to recognize that parallel 
□ prefix-suffix architectures could be applied to an algorithm 

based on binary associative operators that are substantially more 
complex than 1-bit addition. Conventional parallel prefix 
architectures trade reduced area for higher latency and account 
20 for a secondary restriction of limited fanout of each 

computational module. This latter restriction is important when 
the computational modules are small and have delay comparable to 
the delay of wires and buffers (e.g., in adder design). The 
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fusion operators, however, are relatively large. Consequently, 
given current VLSI trends, it is believed that they will dominate 
the overall delay for the near future. Thus, an architecture 
described herein minimizes latency with the minimal number of 
5 computational modules without regard to f anout . 

Specifically, the forward and backward metrics, f*_i and b N -kr 
for k = 1, 2,..., N can be obtained using a hierarchal tree- 
structure based on the fusion-module (FM) array shown in Fig. 2. 
□ A complete set of C matrices on the interval {ko, ko+K} as the 
f|0 2K-1 matrices C ( ko, ko+m) and C {ko+m, ko+K) for m = 1,2,..., iC-1 along 
fl] with C ( ko, ko+K) . This is the MSM information for all state pairs 
*J on the span of K steps in the trellis with one state being either 

on the left or right edge of the interval. The module in Fig. 2 
^if fuses the complete sets of c matrices for two adjacent span-K 
4(5 intervals to produce a complete set of C matrices on the combined 
^ span of size 2K. Of the 4K-1 output C matrices, 2K are obtained 
from the 2(2K-1) inputs without any processing. The other 2K-1 
output C matrices are obtained by 2K-1 C Fusion Modules, or CFMs, 

which implement the <g)c operator. 
20 The basic span-K to span-2K FM array shown in Fig. 2 can be 

utilized to compute the C matrices on the entire interval in IgN 
stages. This is illustrated in Fig. 3 for the special case of N = 
16. Note that, indexing the stages from left to right (i.e., 
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increasing span) as i = 1, 2,..., n = IqN it is clear that there 
are 2 n ~ 2 FM arrays in stage i. 

Because the final objective is to compute the causal and 
anticausal state metrics, however, not all FM's need be CFMs for 
5 all FM arrays. Specifically, the forward state metrics f k .± can be 
obtained from f ffl -i and C{m r k) via 

A-i ( s k ) = min [fm-X (*m ) + C (s m , s k )] (16) 

Similarly, the backward state metrics can be updated via 

£ b k (s k ) = rnm[b m (s m ) + C(s k ,s m )] (17) 

Li j m 

ffeO A processing module that produces an f vector from another f 

Cf vector and a C matrix, as described in equation (16), is referred 
" r to as an f Fusion Module (fFM) . A b Fusion Module (bFM) is 
f« defined analogously according to the operation in equation (17). 
'r* Fig. '3 indicates which FM's may be implemented as fFMs or bFMs. 
;i5 The importance of this development is that the calculation 

of the state metrics has O(lqN) latency. This is because the only 
data dependencies are from one stage to the next and thus all FM . 
arrays within a stage and all FMs within an FM array can be 
executed in parallel, each taking 0(1) latency. The cost of this 
20 low latency is the need for relatively large amounts of area. One, 
mitigating factor is that, because the stages of the tree operate 
in sequence, hardware can be shared between stages. Thus, the 
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stage that requires the most hardware dictates the total hardware 1 
needed. A rough estimate of this is N CFMs, each of which 
involves S 2 S-way ACS units with the associated registers. A more 
detailed analysis that accounts for the use of bFMs and f FMs 
5 whenever possible is set out below in the section heading 

"Hardware Resource Requirements". For the example in Fig. 3 and a 
4-state FSM (i.e., S' = 4), stage 2 has the most CFMs (8), but 
stage 3 has the most processing complexity. In particular, the 
n complexity of stages i = 1,2,3,4 measured in terms of 4 4-way ACS 
flO units is 26, 36, 32, and 16, respectively. Thus, if hardware is 
^ shared between stages, a total of 36 sets of 4 4-way ACS units is 

required to execute all FMs in a given stage in parallel. For 
^ applications when this number of ACS units is prohibitive, one 

can reduce the hardware requirements by as much as a factor of S 
□5 with a corresponding linear increase in latency. 
Q The implementation of the completion operation defined in 

equation (8) should also be considered. The basic operation 
required is a Q-way ACS unit where Q is the number of transitions 
consistent with u*. Assuming that at most half of the transitions 
20 will be consistent with u kl Q is upper bounded by S 2 /2. 

Consequently, when S is large, low-latency, area-efficient 
implementations of the completion step may become an important 
issue. Fortunately, numerous low-latency implementations are 
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well-known (e.g., such as described in P. J. Black, Algorithms 
and Architectures for High Speed Viterbi Decoding. PhD thesis, 
Stanford University, California, March 1993.). One 
straightforward implementation may be one that uses a binary tree 

5 of comparators and has latency of 0(lqS 2 ) . For small 5, this 
additional latency generally is not significant. 

The computational complexity of the state metric 
calculations can be computed using simple expressions based on 
Figs. 2 and 3. As described below in the section heading 

0 "Computation Complexity Analysis", the total number of 

computations, measured in units of S S-way ACS computations is 



For the example in Fig. 3 and a 4-state FSM, an equivalent 
of 110 sets of 4 4-way ACS operations are performed. This is to 

5 be compared with the corresponding forward-backward algorithm 
which would perform 2N = 32 such operations and have baseline 
architectures with four times the latency. In general, note that 
for a reduction in latency from N to lgW, the computation is 
increased by a factor of roughly (1/2) {lqN -3)S + 1. Thus, while 

0 the associated complexity is high, the complexity scaling is sub- 
linear in N. 



N SfS = N((\gN-3)S + 2) + 4S-2 



(18) 
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Optimizations for Sparse Trellises 

In general, the above architecture is most efficient for 
fully-connected trellises. For sparser trellis structures, 
however, the initial processing modules must process C-matrices 
5 containing elements set to °o, accounting for MSMs of pairs of 

states between which there is no seguence of transitions, thereby 
wasting processing power and latency. Alternative embodiments may 
incorporate optimizations that address this inefficiency. 
□ Consider as a baseline a standard 1-step trellis with S = isf* 

[JO states and exactly M transitions into and out of each state, in 
p which, there exists exactly one seguence of transitions to go 
=J from a given state at time s k to a given state s k + L . One 
^ optimization is to pre-collapse the one-step trellis into an R- 
rjs step trellis, 1 < .R < L, and apply the tree-SISO architecture to 
^5 the collapsed trellis. A second optimization is to, wherever 
* s f possible, simplify the c fusion modules. In particular, for a 
SISO on an R-step trellis, the first Ig (L/R) stages can be 
simplified to banks of additions that simply add incoming pairs 
of multi-step transition metrics. 
20 More precisely, pre-collapsing involves adding the R metrics 

of the 1-step transitions that constitute the transition metrics 

of each super-transition , for k = 0,1,..., {N-l)/R. The SISO 

accepts these inputs and produces forward and backward MSMs, f kR - 
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i{s k ) and' b( k +i) R (S{ k +i) R ) , for k = 0, 1,..., N/R. One key benefit of 
pre-collapsing is that the number of SISO inputs is reduced by a 
factor of R, thereby reducing the number of stages required in 
the state metric computation by lgiR. One potential disadvantage 
5 of pre-collapsing is that the desired soft-outputs must be 

computed using a more complex, generalized completion operation. 
Namely, 

SO?-' „ min L (,,)+*£•«• (r )S )+W. W 

m ~ SI ( u kR + m) m = 0, 1, R - 1 (19) 

CBO One principal issue is that for each u kR + m this completion step 
\j involves an (M L+R /2 ) -way ACS rather than the (M L+1 /2)-way ACS 
= required for the 1-step trellis. 

Hi In order to identify the optimal R (i.e., for minimum 

latency) assuming both these optimizations are performed, the 

[is relative latencies of the constituent operations are needed. 

While exact latencies are dependent on implementation details, 
rough estimates may still yield insightful results. In 
particular, assuming that both the pre-collapsing additions and 
ACS operations for the state metric and completion operations are 

20 implemented using binary trees of adders/comparators, and their 

estimated delay is logarithmic in the number of their inputs. One 
important observation is that the pre-collapsing along with lg£ 
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simplified stages together add L 1-step transition metrics 
(producing the transition metrics for a fully-connected L-step 
trellis) and thus can jointly be implemented in an estimated IgL 
time units. In addition, the state metric (A^) -way ACS units take 
5 lgi^ time units and the completion units (M L+R /2 ) -way ACSs take 

lg{M L+R /2) time units. Assuming maximal parallelism, this yields a 
total latency of 

\gL + \g(N/R)\g(M L ) + \g(M L+R /2) (20) 

^ It follows that the minimum latency occurs when R - L IqR is 

~%0 minimum (subject to 1 < R < L) , which occurs when R = L. This 
:** suggests that the minimum-latency architecture is one in which 
l z j the trellis is pre-collapsed into a fully-connected trellis and 

more complex completion units are used to extract the soft 
Ul outputs from the periodic state metrics calculated. 
L31 5 The cost of this reduced latency is the additional area 

i* required to implement the trees of adders that produce the L-step 
transition metrics and the larger trees of comparators required 
to implement the more complex completion operations. Note, 
however, that this area overhead can be mitigated by sharing 
20 adders and comparators among stages of each tree and, in some 
cases, between trees with only marginal impact on latency. 
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Use in Tiled Sub-Window Schemes 

The present inventors recognized that using smaller 
combining windows reduces latency and improves throughput of 
computing the soft-inverse. Minimum half-window (MHW) algorithms 
5 are a class of SISOs in which the combing window edges Ki and K2 
satisfy Ki < max(0, k - d) and K 2 ^ min(N, k + d) , for k = 0,..., 
N-l - i.e., for every point k away from the edge of the 
observation window, the soft-output is based on a sub-window with 
p left and right edges at least d points from k. 

fJO The traditional forward-backward algorithm can be used on 

Z sub-windows to obtain a MHW-SISO. One particular scheme is the 
.J* tiled sub-window technique in which combining windows of length 
^ 2d + h are used to derive all state metrics. In this scheme, as 
U illustrated in Fig. 4, the windows are tiled with overlap of 

rl5 length 2d and there are such windows. The forward-backward 

lX recursion on each interior sub-window yields h soft outputs, so 
there is an overlap penalty which increases as h decreases. 

For the i th such window, the forward and backward state 
metrics are computed using the recursions, modified from that of 
20 equations (6) and (7) : 



/i /) fe + i) = MSM*(^ +1 ) 



(21) 



(22) 
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If all windows are processed in parallel, this architecture 
yields a latency of 0(d + h) . 

The tree-SISO algorithm can be used in a MHW scheme without 
any overlap penalty and with O(lgd) latency. Consider N/d 
5 combining windows of size d and let the tree-SISO compute C(id, 
id + j) and C((i + l)d, (i + l)d - j) for j = 0, d - 1 and i = 
0, N/d - 1. Then, use one additional stage of logic to compute 
the forward and backward state metrics for all k time indices 
p .that fall within the i th window, i = 0,..., N/d - 1, as follows 
fjjO (this should be interpreted with C(s-d, so) replaced by initial 
S left-edge information and similarly for C(sn-i, Sw+d-i) : 



15 information to f (b) information as in equations (13) and (14). 

The outer minimization corresponds to an fFM or bFM. The order of 
this minimization was chosen to minimize complexity. This is 
reflected in the example of this approach, shown in Fig. 5, where 
the last stage of each of the four tree-SISOs is modified to 

20 execute the above minimizations in the proposed order. The module 
that does this is referred to as a 2Cfb module. This module may 




The inner minimization corresponds to a conversion from c 
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be viewed as a specialization of the stage 2 center CFMs in Fig. 
3. The above combining of sub-window tree-SISO outputs adds one 
additional processing stage so that the required number of stages 
of FMs is lg(d) + 1. 

Computational Complexity Comparison 

The computational complexity of computing the state metrics 
using the forward-backward tiled scheme is the number of windows 
times the complexity of computing the forward-backward algorithm 
on each window. In terms of S S-way ACSs, this can be 
approximated for large N via 

——2(d + h)*—(d + h) (23) 

The computational complexity of computing the state metrics using 
the tree-SISO tiled scheme in terms of S S-way ACSs can be 
developed similarly and is 

jdlg(d)S + 2N = N(Slg(d) + 2) (24) 

Determining which scheme has higher computational complexity 
depends on the relative sizes of h and d. If h is reduced, the 
standard forward-backward scheme reduces in latency but increases 
in computational complexity because the number of overlapped 
windows increase. Since the tiled tree-SISO architecture has no 
overlap penalty, as h is decreased in a tiled forward-backward 
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scheme, the relative computational complexity trade-off becomes 

more favorable for the tree-SISO approach. In fact, for h < 

the computational complexities of the tree-SISO are lower than 
for the tiled forward-backward scheme. 

5 

A Design Example: 4-state PCCC 

The highly parallel architectures considered require large 
implementation area. In this example, an embodiment is described 
in which the area requirements are most feasible for 
•JO implementation in the near future. Specifically, an iterative 

decoder based on 4-state sparse (one-step) trellises is 
N considered. Considering larger S will yield more impressive 
s latency reductions for the tree-SISO. This is because the 
yi latency-reduction obtained by the tree-SISO architecture relative 
[jys to the parallel tiled forward-backward architecture depends on 
U the minimum half-window size. One expects that good performance 

requires a value of d that grows with the number of states (i.e., 
similar to the rule-of-thumb for traceback depth in the Viterbi 
algorithm for sparse trellises) . In contrast, considering pre- 
20 collapsing will yield less impressive latency reductions. For 

example, if d = 16 is required for a single-step trellis, then an 
effective value of d = 8 would suffice for a two-step trellis. 
The latency reduction factor associated with the tree-SISO for 
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the former would be approximately 4, but only 8/3 for the latter. 
However, larger S and/or pre-collapsing yields larger 
implementation area and is not in keeping with our desire to 
realistically assess the near-term feasibility of these 
5 algorithms. 

In particular, a standard parallel concatenated 
convolutional code (PCCC) with two 4-state constituent codes [1], 
[2] is considered. Each of the recursive systematic constituent 
codes generates parity using the generator polynomial G(D) = (1 + 

SO D 2 ) / (1 + D + D 2 ) with parity bits punctured to achieve an overall 

VI systematic code with rate 1/2. 

J 8 * In order to determine the appropriate value for d to be used 

^ in the MHW-SISOs, simulations were performed where each SISO used 
□ a combining window {k - d, ...k + d} to compute the soft-output at 
□5 time k. This is exactly equivalent to the SISO operation obtained 
U by a tiled forward-backward approach with h = 1. Note that, since 
d is the size of all (interior) half-windows for the simulations, 
any architecture based on a MHW-SISO with d will perform at least 
as well (e.g., h = 2 tiled forward-backward, d-tiled tree-SISO, 
20 etc.). Simulation results are shown in Fig. 6 for an interleaver 
size of N = 1024 with min-sum marginalization and combining and 
ten iterations. The performance is shown for various d along with 
the performance of the fixed-interval (N = 1024) SISO. No 
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significant iteration gain is achieved beyond ten iterations for 
any of the configurations. The results indicate that d = 16 
yields performance near the fixed-interval case. This is 
consistent with the rule-of -thumb of five to seven times the 
5 memory for the traceback depth in a Viterbi decoder (i.e., 
roughly d = 7x2 = 14 is expected to be sufficient) . 

Since the required window size is d = 16, the latency 
improvement of a tree-SISO relative to a tiled forward-backward 
p scheme is close to 4 = 16/lg(16). The computational complexity of 
l%0 these two approaches is similar and depends on the details of the 
implementation and the choice of h for the tiled forward-backward 
, J approach. A complete fair comparison generally would require a 
**** detailed implementation of the two approaches. Below a design for 

the tree-SISO based sub-window architecture is described. 
^4i5 A factor that impacts the area of the architecture is the 

u bit-width of the data units. Simulation results suggest that an 
8-bit datapath is sufficient. Roughly speaking, a tree-based 
architecture for this example would require 1024 sets of sixteen 
4-way ACS units along with associated output registers to store 
20 intermediate state metric results. Each 4-way ACS unit can be 

implemented with an 8-bit 4-to-l multiplexor, four 8-bit adders, 
six 8-bit comparators, and one 8-bit register. Initial VLSI 
designs indicate that these units require approximately 2250 
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transistors. Thus, this yields an estimate of 16 *2250 ><1024 « 40 
Million transistors. This number or logic transistors pushes the 
limit of current VLSI technology but should soon be feasible. An 
architecture is considered in which one clock cycle is used per 
5 stage of the tree at a 200 MHz clock frequency. For d = 16, each 
SISO operation can be performed in 6 such clock cycles (using one 
clock for the completion step) . Moreover, assume a hard-wired 
interleaver comprising two rows of 1024 registers with an 
Q interconnection network. Such an interleaver would be larger than 
OK) existing memory-based solutions, but could have a latency of 1 
[0 clock cycle. Consequently, one iteration of the turbo decoder, 
sj consisting of two applications of the SISO, one interleaving, and 
~" one deinterleaving, requires 14 clock cycles. Assuming ten 
n^j iterations, the decoding of 1024 bits would take 140 clock 
cycles, or a latency of just 700 ns. 

This latency also implies a very high throughput which can 
further be improved with standard pipelining techniques. In 
particular, a non-pipelined implementation has an estimated 
throughput of 1024 bits per 700 ns = 1 . 5 Gb/second. Using the 
20 tree-SISO architecture one could also pipeline across interleaver 
blocks as described by Masera et al., "VLSI architectures for 
turbo codes," IEEE Transactions on VLSI, vol. 7, September 1999. 
In particular, 20 such tiled tree-SISOs and associated 
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interleavers can be used to achieve a factor of 20 in increased 
throughput, yielding a throughput of 30 Gb/second. 

Moreover, unlike architectures based on the forward-backward 
algorithm, the tree-SISO can easily be internally pipelined, 
yielding even higher throughputs with linear hardware scaling. In 
particular, if dedicated hardware is used for each stage of the 
tree-SISO, pipelining the tree-SISO internally may yield another 
factor of lg(d) in throughput, with no increase in latency. For 
window sizes of d = 16, the tree-based architecture could support 
over 120 Gb/second. That said, it is important to realize that 
with current technology such hardware costs may be beyond 
practical limits. Given the continued increasing densities of 
VLSI technology, however, even such aggressive architectures may 
become cost-effective in the future. 

Computation Complexity Analysis 

The number of required stages is n = lgAT, with 2 n_i FM arrays 
in stage i. Each of these FM arrays in stage i span 2 1 steps in 
the trellis and contains of 2 2 -l FMs. Thus, the total number of 
FMs in stage i is n m {i) = (2 2 -l)2 n_1 . The total number of fusion 
operations is therefore 
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n 

N FM = S W FM(0 

;=I 

J 2 "" 2 " 2 "' 



/=1 



i=\ 



N(\%N-\) + \ (25) 

5 For the example in Fig. 3, this reduces to N m = 49. 
4f Using Fig. 3 as an example, it can be seen that the number 

PJ of FMs that can be implemented as f FMs in stage i is n f (i) = 2 i_1 . 

ty In the special case of i = n, this is interpreted as replacing 

Sj the 2K-1 CFMs by K f FMs and K bFMs. For example, in the fourth 

3 10 stage in Fig. 3, the 15 CFMs implied by Fig. 2 may be replaced by 

=1= 8 f FMs and 8 bFMs, as shown. The number of FMs that can 

i"s implemented as bFMs is the same - i.e., n h {i) = rif(i) = 2 i_1 . It 
follows that the number of CFMs required at stage i is 

M0 =w ™(0-"*(0-M0 + ^( w -0 (26) 

15 =2 n -2"- 1 -2' (27) 

where 5(j) is the Kronecker delta. The total number of fusion 
modules is therefore 
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^f = ^b=E«f(0 = S2' _, =iV-l (28) 

Z=l /=1 

n 

^c=2>c(0 
1=1 

= N{n-\) + \-\±2^ + \ 

= N(\gN-?>) + 4 (29) 

5 Comparing equations (25) and (29), it is seen that, for 

relatively large N, the fraction of FMs that must be CFMs is (lg 
N - 3)/(lg N - 1) . For smaller N, the fraction is slightly 
larger. For example, in Fig. 3, N £ = ]% = 15 and there are 20 
CFMs. 

0 The CFM is approximately S (i.e., the number of states) 

times more complex than the fFM and bFM operations. This can be 
seen by comparing equation (15) with equations (16) and (17). 
Specifically, the operations in equations (15), (16), and (17) 
involve S-way ACSs. For the CFM, an S-way ACS must be carried out 

5 for every possible state pair (s^,^) in (15) - i.e., S 2 state 

pairs. The S-way ACS operations in equations (16), and (17) need 
only be computed for each of the S states s*. Thus, taking the 
basic unit of computation to be S S-way ACS on an S-state 
trellis, the total number of these computations required for 
0 stage i is 
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n sjs (0 = Sn c (0 + «f (0 + n h (/) (30) 
Summing over stages, the total number of computations is 
obtained, measured in units of S S-way ACS computations 

*W = SN C +2N { = A^((lgiV-3)5 + 2) + 4S-2 (31) 

5 For the example in Fig. 3 and a 4-state FSM, an equivalent of 110 
sets of S S-way ACS operations are performed. This is to be 
compared with the corresponding forward-backward algorithm which 
would perform 2N = 32 such operations and have baseline 
v[? architectures with four times the latency. In general, note that 

ill 

s jE0 the for a reduction in latency from N to lg N, the computation is 

f II 

vi increased by a factor of roughly (1/2) (lg N - 3) S + 1. Thus, 
^ while the associated complexity is high, the complexity scaling 
f~ s is sub-linear in N. For small S, this is better than linear-scale 
p.; solutions to low-latency Viterbi algorithm implementations 
9j5 (e.g., such as described in P. J. Black, Algorithms and 

Architectures for High Speed Viterbi Decoding. PhD thesis, 
Stanford University, California, March 1993; and Fettweis, et 
al., "Parallel Viterbi algorithm implementation: Breaking the 
ACS-bottleneck" IEEE Trans. Commun., vol. 37, pp. 785-790, August 
20 1989). 
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-BO 



15 



Hardware Resource Requirements 

The maximum of n s ,s d) over i is of interest because it 
determines the minimum hardware resource requirements to achieve 
the desired minimum latency. This is because the fusion modules 
can be shared between stages with negligible impact on latency. 

The maximum of n s ,s (i) can be found by considering the 
condition on i for which n SfS (i) > n s ,s (i-1) • Specifically, if i 
< n, 



<=>2 n £2 2M (l-S -1 ) 
^ + l-lg(l-S- 1 ) 



<=>/<- 



(32) 
(33) 

(34) 



It follows that n St $(i) has no local maxima and 

w + l-lg(l-S- ! )' 



i = 



(35) 



can be used to find the maximizer of n s , s (i) . Specifically, if 
equation (35) yields i < n-1, then the maximum occurs at i , 
otherwise (i* = n-1) , the i = n-1 and i = n cases should be 
compared to determine the maximum complexity stage. For S ^ 4, 
equation (35) can be reduced to 

/7 + 1 



i = 



(36) 
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since 0.5 < — \ — J -<0.71 for S > 4 



Other Embodiments of a Tree -Structured SISO 

Generally, in implementing a tree-structured SISO, any of 
5 various tree structures can be used that represent a trade-off 
between latency, computational complexity and IC area according 
to the system designer's preferences. Figs. 8A and 8B show an 
embodiment referred to as a Forward-Backward Tree-SISO (FBT- 
^ SISO) . In the FBT-SISO, the tree structure recursion is bi- 
JEO directional - inward and outward. As a result, the FBT-SISO has 
sj twice the latency of the Tree-SISO, but it also achieves a 

significant decrease in computational complexity and uses less 
area on an integrated circuit (IC) chip. 

Details of message passing using the generalized forward- 
backward schedule are shown in FIGS. 8A and 8B in min-sum 
r ~ h processing. The inward messages are shown in FIG. 8A. 

Specifically, initially MI [tk] is set to uniform and the 
algorithm begins by activating the first level of subsystems to 
compute Mk[tk] from MI[x(tk)] and MI[a(tk)]. The messages passed 
20 inward to the next level of the tree are MSMk[sk, sk+1] which is 
simply Mk[tk] if there are no parallel transitions. This inward 
message passing continues with the messages shown. When the two 
messages on s4 reach V4-s, the outward propagation begins and 
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proceeds downward as shown in FIG. 8B. Again, all nodes at a 
given level of the tree are activated before activating any. of 
the nodes at the next level. At the bottom level, the input 
metric of (sk, sk+1) is MSM{ k}c [sk, sk+1] - i.e., the sum of the 
5 forward and backward state metrics in the standard forward- 
backward algorithm. Thus, the final activation of the nodes on 
the bottom level produces the desired extrinsic output metrics. 

This FBT-SISO has twice the latency of the Tree-SISO because 
pi the messages must propagate both inward and outward. This modest 
Jj|0 increase in latency is accompanied by a significant reduction in 
2* computational complexity. Specifically, the FBT-Tree SISO has 

0(K) computational complexity and 0(log2K) latency. This is to 
^ be compared to 0(K log2K) computational complexity and 0(log2K) 
P latency for the Tree-SISO and 0(K) computational complexity and 
CDS 0(K) latency for message passing on the standard trellis. 
O Other embodiments of a tree-structured SISO may be 

advantageous depending on the designer's goals and constraints. 
In general, a tree-structured SISO can be constructed that uses 
virtually any type of tree structure. For example, advantageous 
20 results may be obtained with a tree-structured SISO that uses one 
or more of the tree structures described in R.P. Brent and H.T. 
Kung, "A regular layout for parallel adders," IEEE Transactions 
on Computers, 031:260-264 (March 1982) (the "Brent-Kung tree"); 
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A.E. Despain, "New Computer Architectures," Chapter 2: Notes on 
computer architecture for high performance, Academic Press 
(1984); T.H. Cormen, C. Leiserson, and R.L. Rivest, "Introduction 
to Algorithms," The MIT Press (1990); and/or S.M. Aji and R.J. 
5 McEliece, "The generalized distributive law," IEEE Trans. Inform. 
Theory, 4 6 (2 ): 325-343 (March 2000). 

Fig. 9 shows a representation of a Brent-Kung Tree-SISO in 
which calculation of the branch metrics in a d=16 sub-window uses 
p a modified version of the Brent-Kung model. There are six 
t%0 pipeline stages in the Brent-Kung Tree-SISO which are indicated 
?S by the dotted lines. Nodes shaded gray represent a 2-way ACS 
t'l while the black nodes represent a 4-way ACS. 

The Brent-Kung Tree-SISO is based on the Brent-Kung parallel 
^ adder. It was created to solve the fan-out problem of the prefix 
CB5 adders. Additionally it reduces the amount of hardware needed for 
□ the calculation of all the metrics. The fusion operation is also 
used here, but the amount of fusion operations needed is 
significantly reduced. In the Brent-Kung adder, instead of 
additions, fusion operations are performed at each node. In this 
20 form only the forward information can be calculated. The model 
was modified so that the calculation of both the forward and 
backward metrics could be performed at the same time. 
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This model firstly reduces the number of units that have to 
be driven by the output of an ACS . In the worst case here, 4 
units feed on the output of a single stage (i.e. node CO, 8). This 
is almost half of what the previous approach needed. It is also 
5 observed that there is a significant number of operations that 
only need 2-way ACS units. Considering that the 2-way ACS unit 
uses less than half of the hardware needed in a 4-way ACS unit, 
the total size of the circuitry is expected to drop considerably. 
f=j An analysis similar to the one performed before, verifies this 
3ij0 intuitive assumption. The results are summarized in Fig. 10. 
^ Advantages presented by this approach may be counterbalanced 

^ with a reduction in the system's performance. A close look at the 
^ block diagram in Fig. 10 indicates that the longest path for the 
P calculations is now 6 cycles long. Adding the three extra cycles 
□5 needed for the forward and backward metric calculation, the 
Q completion step and the interleaving process brings the total 
delay of the SISO module to 9 cycles. This means that for the 
whole decoding process takes 180 cycles. Assuming that the same 
cells are used for implementation, this yields a total time of 
20 540nsec for processing one block of 1024 bits of data. The 

throughput of the system in this case will be 1.896Gbps. If the 
unfolded design pipelined after each ACS step is used, that would 
allow one block to be processed after each step, and a throughput 
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of 16.82Gbps could be realized. If 20 SISOs were used to unfold, 
the loop, the throughput would be 341.32Gbps. For the first case, 
which is of interest due to the size of the design, the 
performance degradation is only about 24.1%. At the same time the 
5 number of transistors needed to implement this design is reduced 
by almost 43.8%. 

Decoding Block Codes and Concatenated Blocks with the Tree- 
Structure SI SO 

AO 

There are other relevant decoding problems in which the 

ol 

"?= conventional solution is to run the standard forward-backward 
H algorithm on a trellis. The methods and techniques described 

herein also can be applied in place of the forward-backward 
*1_5 algorithm in order to reduce the processing latency in these 
Jiij applications. One example of such an application is computing 
W the soft inversion of a block error correction code using the 
M trellis representation of the code. A method for decoding a 

block code using a trellis representation was described in Robert 
20 J. McEliece, "On the BCJR Trellis for Linear Block Codes," IEEE 

Transactions on Information Theory," vol. 42,. No. 4, July 1996. 

In that paper, a method for constructing a trellis representation 

based on the parity-check structure of the code is described. 

Decoding, or more generally soft-inversion of the block code, can 
25 be accomplished by running the forward-backward algorithm on this 
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trellis. The methods and techniques described herein can be 
applied to determine the soft-inversion of a block code in a 
manner similar to the tree-structured methods for soft-inversion 
of an FSM. Specif ically, the forward and backward state metrics 
5 can be computed to find the soft-inversion of a block code by 
using a parallel prefix computational architecture with tree- 
structure . 

The application of the tree-SISO architecture to 
Q implementing the soft-inverse of a block code via the parity 
CBO check structure is significant as a practical matter because 
pi several attractive turbo-like codes are based on the 
sj concatenation of simple parity check codes. These codes are 
~~ often referred to as "turbo-like codes" since they have similar 
l*S performance and are decoded using iterative algorithms in which 
Hb soft-inverse nodes accept, update and exchange soft-information 
;=f on the coded bits. Specific examples of turbo-like codes 

include: Low-Density Parity Check Codes (LDPCs) also known as 
Gallager codes and product codes (also known as turbo-block 
codes) , which are described in J. Hagenauer, E. Offer, and L. 
20 Papke, "Iterative decoding of binary block and convolutional 

codes," IEEE Transactions on Information Theory, 42 (2) : 429-445, 
March 1996, and R.G. Gallager, "Low-Density Parity-Check Codes," 
MIT Press, Cambridge, MA, 1963. 
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As a specific example, the parity check trellis for a 
standard Hamming (7 , 4 ) code is shown in Fig. 11. The forward- 
backward algorithm may be run on this trellis with latency on the 
order of 7. Alternatively, a tree-SISO architecture can be used 
5 resulting in the computational structure shown in Fig. 12. As in 
the case of FSM soft-inversion, the tree-structured computation 
yields the same forward and backward state metrics as the 
conventional solution method (i.e., running the standard forward- 
backward algorithm on a trellis), but has logarithmic latency 



[ibO (roughly H the latency in this simple example) . In general, for 
fij a block code with length N, the conventional approach based on 
s ] the forward-backward algorithm will have latency 0(N), whereas 
* s ~ the same computation can be computed using the tree-SISO with 
latency 0 (lg N) . 

j=S5 As a specific example of the use for a tree-structured 

p parity check node SISO such as the one illustrated in Fig. 12, 
consider the iterative decoder for an LDPC, as illustrated in 
Fig. 13. This decoder has two basic types of soft-inverse 
processing nodes: the soft broadcaster nodes and the parity-check 
20 SISO nodes. In one embodiment, the decoder executes by 

activating the soft-broadcaster nodes in parallel, followed by 
permuting the generated soft-information which becomes the input 
to the parity check SISO nodes. These nodes are activated 
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producing updated beliefs on the coded bit values that are passed 
back through the permutation and become the input to the soft- 
broadcaster. This may be viewed as a single iteration. Several 
iterations are required before the decoding procedure is 
5 complete. As in turbo decoding, the stopping criterion is part 
of the design, with a common choice being a fixed number of 
iterations. The computation shown in Fig. 11 is performed in 
each parity check SISO node in a preferred embodiment. 
rj Alternatively, the tree-structured architecture can be used for 
l%0 the computation at each parity check SISO node. In the most 
71 common form, the parity check nodes in LDPCs are single-parity 
checks, and therefore have a two-state trellis-representation. 

iJ Conclusion 

CS5 Based on the interpretation of the SISO operation in terms 

p of parallel prefix/suffix operations, a family of tree-structured 
architectures are described above. Compared to the baseline 
forward-backward algorithm architecture, the tree-SISO 
architecture reduces latency from 0{N) to 0{lqN) . Alternative 
20 tree-structured SISOs, such as the FBT-SISO, trade a linear 

increase in latency for substantially lower complexity and area. 

An efficient SISO design generally is not built using a 
single tree-SISO, but rather using tree-SISOs as important 
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components. For example, as described above, many tree-SISOs were 
used to comprise a SISO using tiled sub-windows. Latency in that 
example reduced from linear in the minimum half -window size (d) 
for fully-parallel tiled architectures based on the forward- 
5 backward algorithm, to logarithmic in d for tiled tree-SISOs. 

In general, potential latency advantages of the tree-SISO 
are clearly most significant for applications requiring large 
combining windows. For most practical designs, this is expected 
q when the number of states increases. In the one detailed 4-state 
fKO tiled-window example considered, the latency was reduced by a 
?:% factor of approximately 4 . For systems with binary inputs and S 

states, one would expect that d = 8 lg(S) would be sufficient. 
- M Thus, there is a potential reduction in latency of approximately 
^ 8 lg(S) / lg(8 lg S) which becomes quite significant as S 
ES5 increases. However, the major challenge in achieving this 
□ potential latency improvement is the area required for the 

implementation. In particular, building a high-speed S-way ACS 
unit for large S is the key challenge. Techniques to reduce this 
area requirement without incurring performance degradations 
20 (e.g., bit-serial architectures) are promising areas of research. 
In fact, facilitating larger S may allow the use of smaller 
interleavers, which alleviates the area requirements and reduces 
latency. 
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Various implementations of the systems and techniques 
described here may be realized in digital electronic circuitry, 
integrated circuitry, specially designed ASICs (application 
specific integrated circuits) or in computer hardware, firmware, 

5 software, or combinations thereof. 

A number of embodiments of the present invention have been 
described. Nevertheless, it will be understood that various 
modifications may be made without departing from the spirit and 
scope of the invention. Accordingly, other embodiments are within 

0 the scope of the following claims. 



54 



